Instructions

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1


Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{(N+1)}{2}\)


Following the instructions from above, here is my R syntax to generate a random variable X:

set.seed(123)

N <- 10
X <- runif(10000, 1, N)
Y <- rnorm(10000, (N+1)/2, (N+1)/2)

From the above syntax, you can see that I’ve created a variable \(X\) with 10,000 random uniform numbers from 1 to \(N\) (\(N=10\)). Additionally, I’ve created a random variable \(Y\) that has 10,000 random normal numbers with a \(\mu\) and \(\sigma\) of \(\frac{N+1}{2}\).


Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the median of the \(X\) variable, and the small letter \(y\) is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.


To work through this, I’ll first have to calculate \(x\) (the median) and \(y\) (the first quartile). Additionally, I’ve saved \(X\) and \(Y\) in a dataframe can stored the total number of rows (10,000) in a variable:

x <- quantile(X, 0.50)
y <- quantile(Y, 0.25)
df <- data.frame(X = X, Y = Y)
total_rows <- nrow(df)

a) \(P(X>x \ | \ X>y)\)


We can use the following equation to find the probability:

\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X>x) \ and \ B = P(X>y)\]

With this in mind, we can solve:

P_AandB <- nrow(df %>% filter(X > x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows

P_AgivenB <- P_AandB / P_B
P_AgivenB
## [1] 0.5509642

Solution: The probability \(P(X>x \ | \ X>y) = 0.5510\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 given this number is greater than 1.84 is 0.5510.


b) \(P(X>x, \ Y>y)\)


We can use the following equation to solve this portion:

\[P(A,B)=P(A) \times P(B) \\ where \ A=N(X>x) \ and \ B=N(Y>y)\]

With this in mind, we can solve:

P_AandB_2 <- nrow(df %>% filter(X > x & Y > y)) / total_rows
P_AandB_2
## [1] 0.3756

Solution: The probability \(P(X>x, \ Y>y) = 0.3756\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 and this number is greater than 1.84 is 0.3756.


c) \(P(X<x, \ X>y)\)


Similar to part a, we can use the following equation to solve:

\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X<x) \ and \ B = P(X>y)\]

With this in mind, we can solve:

P_AandB <- nrow(df %>% filter(X < x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows

P_AgivenB_2 <- P_AandB / P_B
P_AgivenB_2
## [1] 0.4490358

Solution: The probability \(P(X<x \ | \ X>y) = 0.4490\), which means that the probability that a uniform number from 1 to 10 is less than the median of 5.45 given this number is greater than 1.84 is 0.4490.


Investigate whether \(P(X>x \ and \ Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.


To do this investigation, we’ll first have to calculate the joint probabilities:

We’ll calculate joint probabilities by using the equation:

\[P(A \cap B) = P(A \times B)\] Therefore, we first will create a few columns to distinguish whether our values meet the following conditions:

Then, we can take the count of each condition, and multiply the four conditions to get their respective joint probabilities:

Below is the R syntax I used to do this, by using group_by() and summarise() functions to do the calculations with the tidyverse package:

# Joint probabilities
prob_df <- df %>% 
  mutate(A = ifelse(X > x, " X > x", " X < x"), B = ifelse(Y > y, " Y > y", " Y < y")) %>% 
  group_by(A, B) %>% 
  summarise(count = n()) %>% 
  mutate(prob = count / total_rows)

Next, we need to back up a bit and examine the marginal probabilities of \(P(A)\) and \(P(B)\) independently:

Intuitively we know that \(x = median(X)\), therefore, half of the values of \(X\) > \(x\) and the other half of \(X\) < \(x\). Similarly, since we know that \(y = first \ quartile \ of \ Y\), a quarter of \(Y\) < \(y\) and the remaining three fourths of the values of \(Y\) > \(y\). Therefore, the marginal probabilities are:

We can check these assumptions by reworking our table:

prob_df <- prob_df %>% 
  ungroup() %>% 
  group_by(A) %>% 
  summarise(count = sum(count), prob = sum(prob)) %>% 
  mutate(B = "Total Prob") %>% 
  bind_rows(prob_df)

prob_df <- prob_df %>% 
  ungroup() %>% 
  group_by(B) %>% 
  summarise(count = sum(count), prob = sum(prob)) %>% 
  mutate(A = "Total Prob") %>% 
  bind_rows(prob_df)

And by printing our table, we can see that our marginal probabilities that we assumed are confirmed:

prob_df %>%
  select(-count) %>% 
  spread(A, prob) %>%
  rename("Conditions" = B) %>%
  kable() %>%
  kable_styling(bootstrap_options = c('striped'), full_width = FALSE)
Conditions X < x X > x Total Prob
Y < y 0.1256 0.1244 0.25
Y > y 0.3744 0.3756 0.75
Total Prob 0.5000 0.5000 1.00

Solution: After generating the table above, it appears that \(P(X>x \ and \ Y>y)\) does equal \(P(X>x)P(Y>y)\), since they both have a value of 0.3756. By doing the calculations to check, \(P(X>x)P(Y>y) = (0.50)(0.75) = 0.375\) and \(P(X>x \ and \ Y>y) = 0.375\) as well.


Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?


In order to run our Fisher’s Exact Test and Chi Square Test, we’ll first need to find the counts of each our conditions, similar to the previous question and create a two-by-two table:

X_plus <- nrow(df %>% 
  filter(X > x))

X_lesseq <- nrow(df %>% 
  filter(X <= x))

Y_lesseq <- nrow(df %>% 
  filter(Y <= y))

Y_plus <- nrow(df %>% 
  filter(Y > y))

freq_matrix <- matrix(c(X_plus, X_lesseq, Y_plus, Y_lesseq), 
                      nrow = 2, ncol = 2, byrow = TRUE,
                      dimnames = list(c("x", "y"),
                                c("X > x; Y > y", "X <= x; Y <= y")))

With our frequency table ready to go, we can see our counts below:

freq_matrix %>% 
  kable() %>%
  kable_styling(bootstrap_options = c('striped'), full_width = FALSE)
X > x; Y > y X <= x; Y <= y
x 5000 5000
y 7500 2500

Now, we are ready to run our Fisher’s Exact Test:

fisher.test(freq_matrix)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  freq_matrix
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.3137986 0.3540659
## sample estimates:
## odds ratio 
##  0.3333333

And here is the Chi-Square Test:

chisq.test(freq_matrix)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  freq_matrix
## X-squared = 1332.3, df = 1, p-value < 2.2e-16

When looking at both of the outputs, we can see that we reject the null hypothesise for both, given that the p-value is less than 0.05 for the 95 percent confidence interval. Since the Fisher’s Exact Test is typically used for smaller sample sizes, and our sample is of 10,000 observations, it would be more appropriate to use the Chi-Square Test.

Solution:

Given our outputs, independence does not hold when we run the Fisher’s Exact Test and Chi-Square test. We reject the null hypothesis that there is no relationship between X and Y. Since the Fisher’s Exact Test is typically used on smaller sample sizes, and we have a large sample size of 10,000 observations, it would be more appropriate to rely on the output from our Chi-Square Test.


Instructions

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Problem 2


Descriptive and Inferential Statistics: Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?


kaggle_df <- read_csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Final%20Exam/train.csv')
# summary(kaggle_df)

On a quick look at the summary, I decided to pull out the numeric values to explore a bit more:

numerics <- data.frame(summary(kaggle_df))

numerics <- numerics %>% 
  filter(!is.na(Freq))

numerics <- data.frame(numerics %>% 
  filter(!grepl('Class', Freq), !grepl('Length', Freq), !grepl('Mode', Freq)) %>% 
  separate(Freq, c('Value', 'Count'), sep = ":", remove = TRUE))

numerics$Var2 <- trimws(numerics$Var2)
numerics$Value <- trimws(numerics$Value)
numerics$Count <- round(as.numeric(trimws(numerics$Count)), 1)

numerics <- numerics %>%
  select(Var2, Value, Count) %>%
  spread(Value, Count) %>% 
  rename("Attribute" = "Var2", "Q1" = `1st Qu.`, "Q3" = `3rd Qu.`, "Max" = "Max.", "Min" = "Min.", "NA" = "NA's") %>% 
  select(Attribute, Min, Q1, Median, Mean, Q3, Max, `NA`)

kable(numerics) %>% 
  kable_styling(bootstrap_options = 'striped')
Attribute Min Q1 Median Mean Q3 Max NA
1stFlrSF 334 882.0 1087.0 1163.0 1391.0 4692 NA
2ndFlrSF 0 0.0 0.0 347.0 728.0 2065 NA
3SsnPorch 0 0.0 0.0 3.4 0.0 508 NA
BedroomAbvGr 0 2.0 3.0 2.9 3.0 8 NA
BsmtFinSF1 0 0.0 383.5 443.6 712.2 5644 NA
BsmtFinSF2 0 0.0 0.0 46.5 0.0 1474 NA
BsmtFullBath 0 0.0 0.0 0.4 1.0 3 NA
BsmtHalfBath 0 0.0 0.0 0.1 0.0 2 NA
BsmtUnfSF 0 223.0 477.5 567.2 808.0 2336 NA
EnclosedPorch 0 0.0 0.0 21.9 0.0 552 NA
Fireplaces 0 0.0 1.0 0.6 1.0 3 NA
FullBath 0 1.0 2.0 1.6 2.0 3 NA
GarageArea 0 334.5 480.0 473.0 576.0 1418 NA
GarageCars 0 1.0 2.0 1.8 2.0 4 NA
GarageYrBlt 1900 1961.0 1980.0 1979.0 2002.0 2010 81
GrLivArea 334 1130.0 1464.0 1515.0 1777.0 5642 NA
HalfBath 0 0.0 0.0 0.4 1.0 2 NA
Id 1 365.8 730.5 730.5 1095.2 1460 NA
KitchenAbvGr 0 1.0 1.0 1.0 1.0 3 NA
LotArea 1300 7554.0 9478.0 10517.0 11602.0 215245 NA
LotFrontage 21 59.0 69.0 70.0 80.0 313 259
LowQualFinSF 0 0.0 0.0 5.8 0.0 572 NA
MasVnrArea 0 0.0 0.0 103.7 166.0 1600 8
MiscVal 0 0.0 0.0 43.5 0.0 15500 NA
MoSold 1 5.0 6.0 6.3 8.0 12 NA
MSSubClass 20 20.0 50.0 56.9 70.0 190 NA
OpenPorchSF 0 0.0 25.0 46.7 68.0 547 NA
OverallCond 1 5.0 5.0 5.6 6.0 9 NA
OverallQual 1 5.0 6.0 6.1 7.0 10 NA
PoolArea 0 0.0 0.0 2.8 0.0 738 NA
SalePrice 34900 129975.0 163000.0 180921.0 214000.0 755000 NA
ScreenPorch 0 0.0 0.0 15.1 0.0 480 NA
TotalBsmtSF 0 795.8 991.5 1057.4 1298.2 6110 NA
TotRmsAbvGrd 2 5.0 6.0 6.5 7.0 14 NA
WoodDeckSF 0 0.0 0.0 94.2 168.0 857 NA
YearBuilt 1872 1954.0 1973.0 1971.0 2000.0 2010 NA
YearRemodAdd 1950 1967.0 1994.0 1985.0 2004.0 2010 NA
YrSold 2006 2007.0 2008.0 2008.0 2009.0 2010 NA

With my numerics dataframe ready, I can now set up some visuals by splitting the dataframe into two, one with all quantative variables, and the other with categorical data:

kaggle_numerics <- kaggle_df[, (colnames(kaggle_df)) %in% numerics$Attribute]
kaggle_categorical <- kaggle_df[, !(colnames(kaggle_df) %in% numerics$Attribute)]
Exploring the quantitative variables
kaggle_numerics %>% 
  dplyr::select(2:11) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

kaggle_numerics %>% 
  dplyr::select(12:21) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

kaggle_numerics %>% 
  dplyr::select(22:31) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

kaggle_numerics %>% 
  dplyr::select(32:38) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

After creating a few plot matrices with our quantitative data, I thought there were a few initial takeaways:

  • Many of the living spaces, and the square footage values seem to show stronger correlations with one another. This makes sense given that many times, the square footage of one floor will likely be similar square footage to another floor in the house. For instance, there’s a strong correlation between the first floor square footage and the basement square footage. When thinking about our regression later, it’s nice to know that there are multiple attributes here that could serve as proxies (if needed).

  • Additionally, I’m seeing a gradual, positive trend towards larger areas of garages the later they are built in the 20th century – for instance, garages built in the early part of the 20th century seem to be smaller than those that are built in the later 1900s and early 2000s. This may be something to look into further, and it’s relationship with the Sale Price later on.

  • It’s interesting to see too that the year the house is built has a relationship with the rating of overall material and finish quality of the house. Again, this make sense, but it would be valueable to see if this has a relationship with Sale Price.

Now, let’s explore the categorical variables a bit.

Exploring a few of the categorical variables
table(kaggle_categorical$LandSlope)
## 
##  Gtl  Mod  Sev 
## 1382   65   13
# table(kaggle_categorical$BldgType)
table(kaggle_categorical$HouseStyle)
## 
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    154     14    726      8     11    445     37     65
# table(kaggle_categorical$RoofStyle)
table(kaggle_categorical$RoofMatl)
## 
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    1434       1       1       1      11       5       6
# table(kaggle_categorical$ExterCond)
table(kaggle_categorical$CentralAir)
## 
##    N    Y 
##   95 1365
# table(kaggle_categorical$Utilities)
table(kaggle_categorical$HeatingQC)
## 
##  Ex  Fa  Gd  Po  TA 
## 741  49 241   1 428
table(kaggle_categorical$SaleType)
## 
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
##    43     2     9     5     5     4   122     3  1267
table(kaggle_categorical$SaleCondition)
## 
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##     101       4      12      20    1198     125
table(kaggle_categorical$Foundation)
## 
## BrkTil CBlock  PConc   Slab  Stone   Wood 
##    146    634    647     24      6      3

After looking through a large number of categorical values, and checking frequencies, a few things stood out to me:

  • It looks like there’s a majority of houses that are built on gentle slopes, however, there’s a proportion that are built on moderate or severe slopes – this could be something to look into further when building out the regression later on. It may be interesting to see if there is any correlation here with the Sale Price.

  • Similarly, the foundation material seems to be split heavily between cinder blocks and poured concrete. This is another feature that may be useful to explore later to see if this has an affect on housing prices, or if there is a connection between the year a house was built and which type of foundation was used.

  • Finally, Sale Type and Sale Condition were interesting values and their frequencies suggest that there isn’t a “one-size-fits-all” approach to certain transactions, making this an enticing set of features to explore further.

Developing a scatterplot matrix

After plotting all of my quantitative variables earlier, I thought I’d hone in on a few that may be useful to explore their relationship with the Sale Price variable. You can find my plot below:

pairs(kaggle_df[, c('1stFlrSF', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', 'YearBuilt', 'SalePrice' )], pch = 19)

Developing a correlation matrix

For our correlation matrix, I will use GrLivArea, 1stFlrSF, and SalePrice. And taking a few of these same attributes, I thought it would be interesting to plot these in a correlation matrix:

kaggle_numerics %>% 
  dplyr::select(`1stFlrSF`, `GrLivArea`, SalePrice) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

corr_matrix <- kaggle_numerics %>% 
  select(`1stFlrSF`, GrLivArea, SalePrice) %>% 
  cor() %>% 
  as.matrix()

As we can see from above, all three attributes, the square footage of the first floor, the above grade (ground) living area and the Sale Price all seem to have a positive relationship with one another. All three distributions also appear to be unimodal and right-skewed. Below is the official correlation matrix:

kable(corr_matrix) %>% 
  kable_styling(bootstrap_options = 'striped', full_width = FALSE)
1stFlrSF GrLivArea SalePrice
1stFlrSF 1.0000000 0.5660240 0.6058522
GrLivArea 0.5660240 1.0000000 0.7086245
SalePrice 0.6058522 0.7086245 1.0000000
Hypothesis testing on correlation matrix

Now, although these correlations seem to be quite strong, we can do some hypothesis testing on their relationships at an 80-percent confidence interval:

corr1 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$`1stFlrSF`, method = 'pearson', conf.level = 0.80)

corr2 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)

corr3 <- cor.test(kaggle_numerics$`1stFlrSF`, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)

corr1
## 
##  Pearson's product-moment correlation
## 
## data:  kaggle_numerics$SalePrice and kaggle_numerics$`1stFlrSF`
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522
corr2
## 
##  Pearson's product-moment correlation
## 
## data:  kaggle_numerics$SalePrice and kaggle_numerics$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245
corr3
## 
##  Pearson's product-moment correlation
## 
## data:  kaggle_numerics$`1stFlrSF` and kaggle_numerics$GrLivArea
## t = 26.217, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5427732 0.5884078
## sample estimates:
##      cor 
## 0.566024

Solution: After running the three pairwise correlations between my variables of SalePrice, 1stFlrSF, and GrLivArea, I can see from the outputs that all three accept the alternative hypothesis that the true correlation is not equal to zero. Therefore, this rejects the null hypothesis the correlations between each pairwise set of variables is zero. I know this to be true since the p-values for all three of my Person’s correlation tests are less than 0.05.

Solution: Additionally, since we are dealing with a large sample size, and our correlation tests are yielding very small p-values, all less than 0.001, I’m not worried about family-wise error when measuring relationships across these three attributes. With p-values this low, it creates a solid argument that we aren’t running into a Type 1 error here.


Linear Algebra and Correlation: Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.